法国专利FR3041791A1 GENERATIVE / DISCRIMINATORY APPROACH TO MONITORING TRANSACTIONAL DIALOGUE STATES BY COLLECTIVE MATRI

专利PDF首页>>法国专利

专利附录

专利说明

权利要求

类似技术

同族专利

引用文献

法律状态

优先权

专利摘要:
A state dialog tracking method employs first and second latent variable models that have been learned by reconstructing a decompositional model generated from annotated learning dialogues. The decompositional model includes, for each of a plurality of dialog state transitions corresponding to a respective turn of one of the training conversations, state descriptors for the initial and final states of the transition and a representation. of the conversation for that turn. The first latent variable model includes embeddings of the plurality of state transitions, and the second latent variable model includes feature embeddings of the state descriptors and feature embeddings of the speech representations. Data is received for a new state dialog transition that includes a state descriptor for the initial time and a respective conversation representation. A state descriptor for the final state of the new state dialog transition is predicted using the learned latent variable models.
公开号:FR3041791A1
申请号:FR1658500
申请日:2016-09-13
公开日:2017-03-31
发明作者:Julien Perez
申请人:Xerox Corp；
IPC主号:

专利说明:

(3) c’est-à- dire, une fonction du minimum de la différence entre la matrice réelle M et le produit des matrices de variables latentes A et B, conditionnées par des poids W, et où {λα,λα} 6 IR2 sont des hyper-paramètres de régularisation optionnels (valeurs scalaires) qui peuvent être apprises par validation croisée, et W est une matrice diagonale qui augmente le poids de certaines des variables d'état, st+i, afin de biaiser les résultats pour une meilleure précision prédictive de ces variables spécifiques. La matrice de poids peut être apprise par validation croisée. Les poids sont choisis pour améliorer l'accord entre la matrice M’ reconstituée (formée en tant que produit de matrices A et B) et la matrice réelle. Ce type d'approche de pondération s’est révélé efficace dans d'autres types de compromis génératifs/discriminatifs de tâches. Voir, par exemple, Ulusoy, et al, « Comparison of generative and discriminative techniques for object détection and classification, » Toward Category-Level Object Récognition, pp 173-195, 2006, Bishop, et al, «Generative or discriminative Getting the best of both words, » BAYESIAN STATISTICS, 8:3-24, 2007. |·| représente la norme de Frobenius de la matrice respective (la racine carrée de la somme des carrés absolus de ses éléments). Cependant, d'autres normes de matrice peuvent être utilisées. Dans l'exemple de réalisation, la matrice de poids a un impact plus important sur au moins certaines des caractéristiques de descripteur d’état final de la matrice reconstruite M que pour les caractéristiques correspondantes du descripteur d'état initial.
Pour effectuer la tâche de minimisation représentée dans l'équation 3, on peut utiliser une méthode de décomposition de matrice, telle que la méthode des moindres carrés alternés qui est une séquence de deux problèmes d'optimisation convexe. Dans une première étape, pour D connue, on calcule la matrice A qui minimise l'équation 4 : A* = argminA\(M - AD)W\l+ λα\Α\1, (4)
Ensuite, pour une A connue, on calcule la matrice D qui minimise l'équation 4 : ZT = argminB\{M - AD)W\l+ Xd\D\l (5)
Au début, la matrice A (et/ou D) peut être initialisée avec des valeurs aléatoires ou une décomposition de valeurs singulières de la matrice M.
En résolvant itérativement les deux problèmes d'optimisation, on obtient les formes de régression Ridge régularisée à points fixes des algorithmes par la méthode des moindres carrés alternés pondérés : A <- (DTWD + λαS)~1DTWM (6) D <- (ATA + Àday1ATM (7) où Π est la matrice identité. Comme cela est présenté dans l'équation 6, la matrice W est utilisée uniquement pour la mise à jour de A, car seules les colonnes de D, représentant les caractéristiques de l'état, sont pondérées différemment. Pour l'optimisation des plongements de D, présentées dans l'équation 7, chaque plongement de session d’appel stocké dans A a le même poids, donc dans cette deuxième étape de l'algorithme, W est en fait une matrice identité et n’apparaît donc pas. Plus précisément, les variables d’état sont un codage zéro-un concaténé des valeurs de consigne de chaque variable. Méthode de prévision (S106Î
Quand une nouvelle observation Zt est reçue à l’instant t, la distribution a posteriori de chacune des variables d’état à l’instant t+1 est prédite, st et les matrices de variables latent A et D étant donnés. L’étape de prédiction (1) suppose le calcul du plongement de la transition actuelle par résolution du problème des moindres carrés correspondant basé sur les deux variables {s,,z,} qui constituent la connaissance actuelle de l’état à l’instant t et le sac de mots du dernier tour composé du système et des énoncés d’utilisateur (génération d’une nouvelle rangée dans la matrice A) ; et (2) P estimation des valeurs intéressantes manquantes, c'est-à-dire la probabilité de chaque valeur de chacune des variables qui constituent l'état s,+1 à l’instant (t+1), par calcul du produit vectoriel entre le plongement de transition calculé en (1) et les plongements de colonne correspondantes de la matrice D, de la valeur de chaque variable de st+i- Plus précisément, cette décomposition peut être écrite de la manière suivante : M = A. DT (8) où M est la matrice des données utilisée pour effectuer la décomposition. M comprend une rangée mi pour chaque transition. Comme indiqué ci-dessus, A a une rangée pour chaque plongement de transition, et D a une colonne pour chaque plongement de valeur de variable dans un encodage zéro-un. Quand une nouvelle rangée d'observations mi est reçue pour un nouvel ensemble de variables pour l'état s| et les observations z[ et que D est fixé, le but de la tâche de prédiction est de trouver la rangée correspondante aj de A pour être tel que: af-DT « mj (9)
Il est généralement difficile d'exiger que celles-ci soient égales, mais il peut être nécessaire qu'elles aient la même projection dans l'espace latent : a[.DT.D = mJ.D (10)
Puis, la solution de forme fermée classique d'une tâche de régression linéaire peut être calculée comme suit : af =7^.0.(0^.0)-1 (11) at = {DT.DyDT.mt (12)
Cette formule est en fait la valeur optimale du plongement de la transition m,, avec l’hypothèse d’utiliser une perte quadratique. Dans le cas contraire, c’est une approximation, dans le cas d'une perte logistique par exemple. Comme on s’en rendra compte, alors que dans l'équation. 11, (DT. Z))-1 entraîne une inversion de matrice, c’est seulement pour une matrice de faible dimension (la dimension des plongements).
Ainsi, étant donné m, (qui ne comprend que les valeurs de slt et zlt), à 1' étape (1) le plongement est calculé à l’aide de l'équation (12). Ensuite, à l'étape (2), on calcule les valeurs manquantes s}+1 en multipliant a, par seulement celles des colonnes de la matrice B correspondant aux plongements des caractéristiques d’état st+i.
La sortie de prédiction est la distribution par rapport aux valeurs pour chacune des variables d'état s,+1 à l'instant t+1. Cette distribution, ou les valeurs les plus probables, peut être propagée à la rangée suivante de la matrice en tant que st, et la méthode de prédiction peut être itérée à chaque interaction avec le client jusqu'à ce que les variables prédéfinies aient été instanciées. Comme on le comprendra, à certaines itérations, il ne peut y avoir aucun changement aux variables d'état.
Plusieurs avantages peuvent être identifiés dans cette approche. Tout d'abord, au moment de l'apprentissage, la régression Ridge alternée est efficace en termes de calcul, car une solution de forme fermée existe à chaque étape du processus d'optimisation utilisé pour en déduire les paramètres, à c'est-à-dire les matrices de faible rang, du modèle. Deuxièmement, au moment de la décision, la procédure de suivi de d’état implique (1) de calculer le plongement a de la transition en cours à l’aide de l'estimation d’état actuel et l'ensemble d'observation en cours zt, et (2) le calcul de la distribution sur l'état défini comme un produit vecteur-matrice entre a et la matrice latente D.
La figure 7 illustre la génération de la matrice M à l’instant de la prédiction (i .. après l’apprentissage des plongements de rangées et de colonnes formant les matrices A et D. Le système établit d'abord l'intention du client de trouver un restaurant (S106). L'intention peut être déduite de l'utilisateur qui s’est connecté sur un site de réservation de restaurant associé, et/ou par des questions simples (par exemple, « pour réserver un restaurant dire OUI ou appuyer sur 1 maintenant »,.....(attendre la réponse du client), ... pour réserver un hôtel, dire Oui ou appuyer sur 2, maintenant »). Le système instancie alors la matrice correcte M avec l’ensemble approprié de fenêtres. Pour faciliter l'illustration, on suppose qu'il n'y a que trois variables, emplacement, type et temps, pour la matrice de réservation de restaurant M et seulement deux valeurs pour chaque variable. La matrice M comprend également une fenêtre pour chacun d'un ensemble de mots que le système est configuré pour reconnaître dans la représentation sac de mots 66 composé du système et des énoncés client 64, 62 pour un tour donné. À un premier tour (tour 1), le composant de collecte d'informations 36 génère un énoncé de système 64 demandant où le client souhaite manger et le client répond. Le composant de collecte d'informations 36 analyse l'énoncé client 62. Le composant de génération de représentation 40 identifie un sac de mots provenant de la sortie du composant de collecte d'informations, qui inclut les mots manger et Rome. Le composant de remplissage de fenêtre 38 remplit les valeurs s, et la représentation sac de mots dans une rangée de la matrice. Le composant de prédiction de suiveur de dialogue 30 prédit les valeurs st+1 de cette rangée, à l’aide des matrices de variables latentes acquises A et D. Ces valeurs deviennent les valeurs st pour le prochain tour. Dans ce tour, l'énoncé de l’utilisateur n'a pas été reconnu avec suffisamment de confiance et ainsi au prochain tour, le composant de collecte d'informations demande si le client a dit espagnol, que le client confirme, et la prédiction est répétée. Une fois qu’une fenêtre est passée à 1 pour chacune des variables, le système confirme l'objectif du client (tour 5), avant de procéder à l'exécution de la transaction, qui peut inclure la recherche de restaurants espagnols à Rome dans une base de données qui ont une table de disponible à 20 heures et de présenter une liste au client pour effectuer un passage en revue et une sélection. Le système peut alors faire une réservation de restaurant pour le client, sur la base de la sélection du client.
Le procédé illustré sur les figures 2 et 4 peut être mis en oeuvre dans un logiciel qui peut être exécuté sur un ordinateur. Le logiciel peut comprendre un support d'enregistrement non transitoire lisible par un ordinateur sur lequel un programme de commande est enregistré (mémorisé), tel qu'un disque, un disque dur, ou analogue. Les formes courantes de supports non transitoires lisibles par ordinateur comprennent, par exemple, des disquettes, des disques souples, des disques durs, des bandes magnétiques, ou tout autre support de stockage magnétique, CD-ROM, DVD, ou tout autre support optique, une mémoire vive, une PROM, une EPROM, une mémoire FLASH-EPROM, ou autre puce ou cartouche de mémoire, ou tout autre support non transitoire qu’un ordinateur peut lire et utiliser. Le logiciel peut être solidaire de l'ordinateur 18, (par exemple, un disque dur interne ou RAM), ou peut être séparé (par exemple, un disque dur externe connecté de manière opérationnelle à l'ordinateur 18), ou peut être séparé et accessible par l'intermédiaire d'un réseau numérique de données tel qu'un réseau local (LAN) ou l’Internet (par exemple, un réseau redondant de disques indépendants peu coûteux (RAID) ou un autre stockage sur serveur de réseau qui est accessibles indirectement par l'ordinateur 18, par l'intermédiaire d'un réseau numérique).
En variante, I méthode peut être mise en œuvre dans des média transitoires, comme un support transmissible dans lequel le programme de commande est réalisé sous la forme d'un signal de données en utilisant des supports de transmission, tels que des ondes acoustiques ou lumineuses, comme celles produites pendant les communications de données par ondes radio et infrarouge, etc.
La méthode donnée à titre d’exemple peut être mise en œuvre sur un ou plusieurs ordinateurs à usage général, ordinateurs à usage spécifique, un microprocesseur programmé ou un microcontrôleur et des éléments de circuits intégrés périphériques, un circuit ASIC ou un autre circuit intégré, un processeur de signal numérique, un circuit électronique ou logique câblé par exemple un circuit à éléments discrets, un dispositif logique programmable tel qu'un PLD, PLA, FPGA, une carte graphique CPU (GPU), ou PAL, ou analogue. D'une manière générale, tout dispositif capable de mettre en œuvre une machine à états finis qui est à son tour capable de mettre en œuvre l'organigramme représenté sur la figure 2 et/ou 4, peut être utilisé pour mettre en œuvre la méthode. Comme on le comprendra, alors que les étapes de la méthode peuvent toutes être mis en œuvre sur ordinateur, dans certains modes de réalisation, une ou plusieurs des étapes peuvent être au moins partiellement effectuées manuellement. Comme on le comprendra également, les étapes de la méthode ne doivent pas nécessairement être toutes réalisées dans l'ordre illustré et on peut réalisées moins d’étapes, plus d’étapes ou des étapes différentes.
La méthode et le système donnés à titre d’exemple présentent des avantages par rapport aux systèmes existants (1) en produisant une modélisation de probabilité conjointe de la transition de variables cachées composant un état de dialogue donné et les observations qui permet de suivre la croyance actuelle en les objectifs de l'utilisateur tout en tenant compte explicitement des interdépendances possibles entre les variables d'état ; et (2) en fournissant un cadre de calcul, sur la base d’une factorisation de matrice collective, pour inférer efficacement la distribution sur les variables d'état afin d'en tirer une politique de dialogue adéquate de recherche d'informations dans un tel contexte. Alors que le suivi de dialogue transactionnel est surtout utile dans le contexte de la gestion de dialogue autonome, le système peut également être appliqué dans la lecture de machine de dialogue et l'extraction de connaissances à partir de corpus de dialogue humain à humain.
Sans vouloir limiter la portée de l'exemple de réalisation, les exemples suivants illustrent l'application de la méthode à un ensemble de données préexistant.
EXEMPLES
On décrit tout d'abord le domaine de dialogue utilisé pour l'évaluation du suiveur de dialogue et les modèles de probabilité de composants utilisés pour le domaine. Puis, on décrit un ensemble de résultats expérimentaux obtenus avec le suiveur et une comparaison avec les suiveurs existants.
On a utilisé le domaine de dialogue DSTC-2 décrit dans Williams, et al., « The dialog State tracking challenge, » Proc. SIGDIAL 2013 Conf, pp 404-413, 2013. Dans ce domaine, l'utilisateur interroge une base de données de restaurants locaux. L'ensemble de données relatif au domaine d'informations de restaurant a été collecté à l'aide d’Amazon Mechanical Turk. A dialogue habituel est le suivant: tout d'abord, l'utilisateur spécifie son ensemble de contraintes personnel concernant le restaurant. Ensuite, le système offre le nom d'un restaurant qui satisfait les contraintes. L'utilisateur accepte alors l'offre, et peut demander des informations supplémentaires sur le restaurant accepté. Le dialogue se termine lorsque toutes les informations demandées par l'utilisateur ont été fournies. Dans ce contexte, le suiveur d’état de dialogue doit être en mesure de suivre trois types d'informations qui composent l'état de dialogue : les fenêtres zone géographique, type de cuisine, et gamme de prix. Comme on le comprendra, le suiveur peut facilement être mis en place pour suivre les autres variables de même que si elles sont entièrement spécifiées. Le suiveur d’état de dialogue met à jour la croyance tour à tour, en recevant la preuve par le module NLU à mesure que l'énoncé réel a été produit par l'utilisateur.
Dans cette expérience, la sortie du module NLU a été limité à une représentation sac de mot de chaque énoncé d’utilisateur afin d'être comparable avec des approches existantes pour le suivi d’état qui utilisent uniquement des informations telles que la preuve. La tâche du suiveur d'état de dialogue 32 est de générer un ensemble d'états possibles et leurs scores de confiance pour chaque fenêtre, le score de confiance correspondant à la probabilité à posteriori de chaque variable d'état. En outre, le suiveur d’état de dialogue maintient également une variable d'état spécial, appelé « Aucun », qui indique que le véritable état objectif vrai n'a pas encore été observé.
Les résultats expérimentaux de suivi de l'état ont été obtenus pour cet ensemble de données et comparées aux générative existantes et des approches discriminantes. Le tableau 1 donne les variables et l'expression domaine pour chacun d'eux. TABLEAU 1 : fenêtres d’informations DSTC2 (Domaine d’informations restaurant)
Le tableau 2 détaille les résultats de performance en précision à la position (P@n) obtenus sur l’ensemble de données DSTC-2 pour un ensemble de dimensions de plongements du modèle matrice de factorisation collective. Le modèle parvient à déterminer avec précision un petit sous-ensemble d’hypothèse où la bonne instanciation est présente pour chaque variable.
Ensuite, aux fins de comparaison avec les méthodes existantes, le tableau 3 présente les résultats de la précision du meilleur modèle de CMF, une dimension de plongement de 100, où la valeur de chaque fenêtre est instanciée comme étant la plus probable par rapport à la procédure d'inférence décrite ci-dessus. Les résultats sont obtenus pour plusieurs méthodes existantes de suivi d’état génératifs et discriminatifs sur cet ensemble de données. Plus précisément, comme il est prévu par ces approches, le score de précision calcule p(sï+1st,zt).
Les suiveurs existants suivants ont été comparés : 1. Un système fondé sur des règles décrit dans Zilka, et al, « Comparison of Bayesian Discriminative and Genertive Models for Dialogue State Tracking, » Proc. SIGDIAL 2013, pp 452-456, 2013. 2. Un modèle HMM (HWU) décrit dans Wang, « HWU baseline belief tracker pour DSTC 2 & 3. » Technical Report, Heriot-Watt University, 2013. 3. Un modèle HMM modifié (HWU+) décrit dans Wang 2013. 4. Un modèle d'entropie maximale (MaxEnt) décrit dans Lee, et al, « Recipe for building robust spoken dialog State trackers: Dialog State tracking challenge System description, » Proc. SIGDIAL 2013, pp 414-422, 2013, qui est un type de modèle discriminatif. 5. Une architecture à réseau de neurones profonds (DNN) comme décrit dans Henderson, et al., « Word-based dialog State tracking with récurrent neural networks, » Proc. . SIGDIAL, pp 296-299, 2014. 6. CMF, la présente méthode, où le nombre qui suit représente la taille des vecteurs de plongement dans chacune des matrices B, C. Ainsi, par exemple, CMF-100 indique que la matrice de plongement comporte 100 variables cachées dans chaque rangée et les matrices B, C comportent 100 variables cachées dans chaque colonne.
Tableau 2 : Précision du modèle proposé par rapport à d'autres suiveurs sur l’ensemble de données DSTC-2
Tableau 3: Résultats de précision à validation croisée 10 fois obtenus pour chacune des trois variables suivis simultanément (zone, type cuisine et gamme de prix) de l’ensemble de données DSTC-2
Les résultats suggèrent que le système et le procédé donnés à titre d’exemple permettent un suivi d'état de dialogue efficace dans le contexte de type transactionnel de systèmes de dialogue autonome. Les résultats suggèrent que les services aux consommateurs et plus largement dans le cadre de plates-formes d'agent d'automatisation de discussion, seront en mesure de traiter la gestion de dialogue plus efficacement à l’aide du système et de la méthode. En effet, les questions liées à la contractualisation, la facturation, la gestion d'assurance de dispositifs peuvent être automatisées à l’aide d’une telle structure. Le système et la méthode seront évidemment applicables à tout domaine de dialogue qui peut être formalisé comme un type de tâche à remplissage de fenêtres. Plus précisément, un tel système permet un suivi efficace des variables cachées définissant l'objectif de l'utilisateur d’un dialogue orienté vers la tâche à l’aide tout type de preuve disponible, depuis l’énoncé sac-de-mots à la sortie d’un module de Compréhension de langage Naturel NLU.
GENERATIVE / DISCRIMINATIVE APPROACH TO MONITORING TRANSACTIONAL DIALOGUE STATES BY COLLECTIVE MATRIX FACTORIZATION The exemplary embodiment relates to dialogue systems and finds particular application in the context of a system and method for monitoring a state. dialogue by collective matrix factorization.
Automated dialogue systems interact with users through natural language to help them achieve a goal. For example, a user may be interested in finding a restaurant and may have a set of constraints, such as geographic location, date, and time. The system suggests the name of a restaurant that satisfies the constraints. The user can then request additional information about the restaurant. The conversation continues until the user's questions have been answered. There are many other applications in which dialogue systems would be advantageous. For example, in the context of customer care, effective automation could bring increased productivity by increasing the likelihood of success of each call while reducing the overall cost. The use of autonomous dialogue systems is growing rapidly with the delivery of smart mobile devices, but it still faces challenges in becoming a primary user interface for natural interaction through conversations. In particular, when the dialogues are set in noisy environments or when the utterances themselves are noisy, it can be difficult for the system to recognize or understand what the users say.
Dialogue systems often include a dialogue state follower that monitors the progress of the conversation (dialogue and conversation can be used interchangeably here). The dialog state follower provides a compact representation of user input and system output as a dialog state. The dialog state encapsulates the information needed to successfully complete the conversation, such as the purpose or the user's request. The phrase "dialogue state" refers loosely to a representation of the user's knowledge of the needs at any point in a conversation. The precise nature of the dialog state depends on the associated dialog task. An effective dialogue system benefits from a state follower who is able to accumulate evidence, in the form of observations, accurately on the sequence of turns of a conversation, and to adjust the state of the conversation. dialogue based on observations. However, in spoken dialogue systems, in which the user's statement is entered as a voice record, the fact that automatic speech recognition (ASR) and natural language understanding (NLU) may be tainted with errors means that the true statements of the user are not directly observable. This makes it difficult to calculate the true state of dialogue.
A current mathematical representation of a dialog state is a window fill scheme. See, for example, Williams, et al, "Partially Observable Markov Decision Processes for Speech Dialog Systems," Computer Speech & Language, 21 (2): 393-422, 2007, later, "Williams 2007". In this approach, the state is composed of a predefined set of variables with a predefined domain of expression for each of them. The purpose of the dialogue system is to instantiate each of the variables in an efficient manner to execute an associated task and to satisfy the corresponding intention of the user. In the case of the restaurant, for example, this may include, for each of a set of variables, a most likely value of the variable, such as: location: downtown; the date: August 14; the hour: 19:30; type of restaurant: Spanish, (or unknown if the variable has not been assigned a different value).
Various approaches have been proposed to define dialogue state followers. Some systems use craft rules that are based on the most likely result of an NLU module. However, these rule-based systems are subject to frequent errors such that the most likely result is not always correct. In addition, these rule-based systems often lead the customer to respond by using simple keywords and confirming everything they say explicitly, which is far from being a natural conversational interaction. See, Williams, "Web-style ranking and SLU combination for state tracking dialogue," Proc. SIGDIAL, pp 282-291, June 2014. More recent methods have a statistical approach to estimate the posterior distribution on dialog states using the results of the NLU step. Statistical dialogue systems, in maintaining a distribution on several assumptions of the actual dialogue state, are able to behave in a robust manner when confronted with noisy conditions and ambiguity.
Statistical dialogue state followers can be classified into two general approaches (generative and discriminative), depending on how the posterior probability distribution on the state calculation is modeled. The generative approach uses a generative model of dialogue dynamics that describes how NLU results are generated from the hidden dialog state and uses the Bayes rule to compute the posterior probability distribution. The generative approach is a common approach for statistical state monitoring of dialogue, as it integrates naturally into the type of modeling 'Partially Observable Markov Decision Process' (POMDP), which is an integrated model of dialogue tracking state and dialog strategy optimization. See, Young, et al., "POMPD-based statisticians' languages: a review," Proc. IEEE, 101 (5): 1160-1179, 2013. In the context of POMDP, the task of state dialog tracking is to compute the posterior distribution on the hidden states, taking into account the history of the observations. The discriminative approach aims to directly model the posterior distribution by an algebraic closed formulation of a loss minimization problem.
Generative systems are described, for example, in Williams, 2007; Williams, "Exploiting the ASR n-best by tracking multiple dialog state assumptions," INTERSPEECH, pp 191-194, 2008; Williams, "Incremental partition recombination for efficient tracking of multiple dialog states," ICASSP, pp 5382-5355, 2010; Thomson, et al, "Bayesian Update of State Dialog: a POMPD Framework for Talking Dialogues Systems," Computer Speech & Language, 24 (4): 562-588, 2010, in the following, "Thomson 2010."
Discriminative systems are described, for example, in Paek, et al, "Conversation as action under uncetainty," UAI '00: Proc. 16th Conf. In Uncertainty in Artificiall Intelligence, pp. 455-464, 2000, and in Thomson 2010. The successful use of discriminative models for belief tracking has recently been reported in Williams, "Challenges and Opportunities for State Tracking in Statistical Diligence Systems: Results from two public deployments, J. Salt. Topics Signal Processing, 6 (8): 959-970, 2012; Henderson, et al., "Deep Neural Network Approach for the Dialog State Tracking Challenge," Proc. SIGDIAL 2013, pp 467 - 471,2013).
Each of these statistical approaches suffers from certain limitations, such as complex inference at the time of testing, scalability, or restrictions on all possible state variables during learning.
In accordance with one aspect of the exemplary embodiment, a dialog state tracking method includes producing first and second latent variable models that have been learned by reconstructing a decompositional model. The decompositional model is one that has been generated from annotated learning conversations and includes, for each of a plurality of dialog state transitions, state descriptors for the initial and final states of the transition and a respective conversation representation. The first model learned from latent variables includes embeddings of the plurality of state transitions. The second model learned from latent variables includes feature embeddings of state descriptors and feature embeddings of conversational representations. The data for a new state dialog transition is received. The data includes a state descriptor for the initial moment and a respective conversation representation. A state descriptor for the final state of the new dialog state transition is predicted based on it, using the first and second models learned from latent variables.
The prediction of the state descriptor can be performed with a processor.
In accordance with another aspect of the exemplary embodiment, a dialog state tracking system includes a memory that stores the first and second latent variable models that have been learned by reconstructing a decompositional model. The decompositional model has been generated from annotated learning conversations and includes, for each of a plurality of dialog state transitions, state descriptors for the initial and final states of the transition and a conversation representation. respectively. The first model learned from latent variables includes embeddings of the plurality of state transitions. The second model learned from latent variables includes feature embeddings of state descriptors and feature embeddings of dialogue representations. An information gathering component receives statements from a user for each of a plurality of new dialog state transitions. A representation generation component generates a conversation representation based on the user's statements. A prediction component predicts a state descriptor for an end state of each new dialog state transition using the acquired first and second latent variable models and the respective conversation representation and a corresponding initial state descriptor. A processor implements the information collection component, the representation generation component, and the prediction element.
According to another aspect of the exemplary embodiment, a method of identifying a transaction includes learning first and second latent variable models for reconstructing a decompositional model, the decomposition model having been generated from annotated learning conversations and comprising, for each of a plurality of dialog state transitions, state descriptors for the initial and final states of the transition and a respective conversation representation. The first latent variable learned model includes plunging of the plurality of state transitions, and the second learned latent variable model includes feature embeddings of the state descriptors and feature embeddings of the dialogue representations. For each of a plurality of laps of a conversation, the method includes receiving data for a new state dialog transition, the data including a state descriptor for the initial time and a respective conversation representation, predicting a state descriptor for the final state of the new dialog state transition using the learned first and second models of latent variables, and generating an agent dialog act based on the predicted state descriptor. Based on the predicted end state of at least one of the turns of the conversation, a transaction to be implemented is identified.
Fig. 1 is a functional block diagram of a dialogue system according to one aspect of the exemplary embodiment; Fig. 2 is a flowchart illustrating a dialogue tracking method according to another aspect of the exemplary embodiment; Figure 3 illustrates a probabilistic graphical model of latent variables (realized model) of spectral state tracking in a transactional conversation; Figure 4 illustrates a corresponding factor graph for the model of Figure 3; Figure 5 illustrates the learning of a generative model in the method of Figure 2; Fig. 6 illustrates an exemplary spectral state tracking model in which the collective matrix factorization is applied as an inference procedure; and Figure 7 illustrates the generation of the matrix M in the example of the prediction method.
Aspects of the exemplary embodiment relate to a dialogue state tracking system and method by estimating the actual dialogue state of an ongoing conversation from noisy observations produced by the voice recognition modules and / or understanding of natural language. The example system and method allow statistical dialogue state tracking based on a joint probabilistic model that yields a collective matrix factorization inference scheme. The dialogue state follower works well in comparison with existing approaches. The prediction scheme is also efficient in terms of calculation compared to existing approaches. The method includes tracking a posterior distribution on hidden dialog states composed of a set of context dependent variables. A dialog policy, once learned, aims to select an optimal system action, given the estimated dialog state, by optimizing a defined reward function.
Here is described a type of discriminative / generative approach for state tracking that uses spectral decomposition methods and associated inference procedures. The probabilistic model example jointly estimates the state transition with respect to a set of observations. In an exemplary embodiment, the state transition is computed by an inference procedure having a linear complexity with respect to the number of variables and observations.
Referring to Figure 1, there is illustrated a transactional dialogue system 10 for performing a transaction by automated analysis of user conversions. User conversations can be textual or spoken (spoken) conversations or a combination of these. The system 10 includes a memory 12 which stores instructions 14 for carrying out the method illustrated in FIG. 2 and a processor 16 in communication with the memory for executing the instructions. The system 10 may include one or more computing devices 18, such as the illustrated server computer. One or more input / output devices 20, 22 allow the system to communicate with external devices, such as the client device illustrated 24 via wired or wireless connections, such as the Internet 26. The hardware components 12, 16 , 20, 22 of the system are in communication via a data / control bus 28.
Computer instructions 14 include a dialog tracking training component 30 and a dialog tracking prediction component 32, referred to herein as spectral status tracking or SST. The system may further include an intent detection component 34, an information collection component 36, a window fill component 38, a utterance representation component 40, and an execution component 42.
Briefly, the learning component 30 learns a generative model 50 which forms the basis of the spectral status tracker 32. Learning is performed with a collection 52 of annotated learning dialogs using a Collective Matrix Factorization (CME). In particular, a decompositional model 54, which includes state descriptors and corresponding dialogue representations for each of a set of laps, is used to learn a plurality of coupled temporal (latent) hidden variable patterns. A, B, and C 56, 58, 60. Model A includes embeddings of the observed dialog state transitions, model B includes embeddings of each of the state descriptor features, and model C includes embeddings of each of the dialogue representation functions. Models B and C can be combined into a single model D 61. Given the learned generative model 50, when new utterances 62 are received during an initial moment t, spectral status follower 32 updates the generative model 50 and predicts the dialog state at a final time (next) t + 1. If there is more than one type of transaction handled by the system, the intention detection component 34 identifies the user's intention to to determine the set of variables to be instantiated in the generative model 50. By way of example, the user of the client device 24 can express an intention to book a flight and the intention detection component 34 identifies the variables: the destination, the date and time each of which must be instantiated from a respective set of predefined values. The variables correspond to the windows to be filled by the window filling component 38 using information from SST 32.
The information collection component 36 implements an information collection policy that automatically generates virtual agent dialog acts 64, such as responses to the user's utterances 62. These dialog acts generated by the system can obtain confirmation of what the user is supposed to have asked in a previous statement or can search for new information. The utterance representation component 40 generates a representation 66 of the dialogue at an initial time t, which is one of the observations that the SST 32 uses to predict the dialogue state at the next instant t + 1. The component of Execution 42 executes the task identified from the dialogue, for example, booking a flight or a restaurant for the user in the examples given by way of illustration.
The computer system 10 may include one or more computing devices 18, such as a desktop computer, a laptop, a handheld computer, a portable digital assistant (PDA), a server computer, a cellular phone, a computer tablet , a pager, a combination thereof, or another computing device capable of executing instructions for implementing the example method.
The memory 12 may represent any type of non-transitory computer readable medium such as random access memory (RAM), a read-only memory (ROM), a disk or a magnetic tape, an optical disk, a flash memory, or a holographic memory. In one embodiment, the memory 12 comprises a combination of a random access memory and a read only memory. In some embodiments, the processor 16 and the memory 12 may be combined on a single chip. The memory 12 stores instructions for implementing the example method, as well as the processed data 50, 66. The network interface 1 8 allows the computer to communicate with other devices via a computer network, such as as a local area network (LAN) or WAN, or the Internet, and may include a modulator / demodulator (MODEM), router, cable, and / or Ethernet port.
The digital processor device 16 can be made in various forms, such as a single-core processor, a dual-core processor (or more generally a multi-core processor), a digital processor and a cooperative mathematical coprocessor, a digital controller, or the like. In addition to the execution instructions 14, the digital processor 16 may also control the operation of the computer 30.
The term "software" as used herein is intended to encompass any collection or set of instructions executable by a computer or other digital system so as to configure the computer or other digital system to perform the task that is the object of the software. The term "software" as used herein is intended to encompass instructions stored in a storage medium such as a RAM, a hard disk, an optical disk, etc., and is also intended to encompass what is termed a "firmware" which is software stored on a ROM or other. Such software can be organized in different ways, and can include software components organized into libraries, Internet-based programs stored on a remote server etc., source code, interpretation code, object code, code directly executable, etc. It is contemplated that the software may invoke system level code or calls to other software residing on a server or other location to perform certain functions.
As will be appreciated, FIG. 1 is a high level functional block diagram of only a portion of the components that are incorporated into a computer system 10. Since the configuration and operation of the programmable computers are well known they will not be described in detail.
Transactional dialogue status tracking
An interesting dialogue state tracking task can be formalized as follows: at each turn of the task-oriented conversation, the information collection component 38 of the dialogue system 10 selects a system dialogue act 64 , noted df to express and the user's responses to the system with a user statement 62, noted u. The dialog state at each turn of a given dialog is defined as a distribution on a set of predefined variables to follow that defines the structure of the dialog state. The construction of the dialog state is called window filling. In a transactional dialog, the status tracking task involves estimating the value of each of a set of predefined variables in order to perform a procedure associated with the task to which the dialog is supposed to match.
In an exemplary embodiment, when the utterances are uttered (voice), the information gathering component 36 includes an NLU module 70 which processes the user's utterance 62 and generates a list of best N's: o = { <di.fi>, ..., <dn, fn>} where di is a hypothetical user dialog act and /) is its associated confidence score. Each hypothesized user dialog act is a sequence of words (or, more generally, tokens) predicted to match the user's utterance. The NLU module 70 can receive as input the output of an automatic speech recognition (ASR) of the module 72, which converts the spoken utterance 62 into text. In a text-based dialogue system, where the statements are in the form of text strings, the ASR module and possibly also the NLU module can be omitted and the text string (s) are considered as the dialogue act. user d.
The representation 66 generated by the utterance representation generator 40 may include a word bag representation of the respective turn of the conversation. The word bag representation may comprise, for each of a set of words, a value representative of the presence or absence of the word in the user dialogue act d (and possibly also the corresponding system dialogue act) . In the simple case, this can be considered as the only proof on which the representation 66 is based. However, if an NLU module 70 is available, standardized dialogue act diagrams can be considered as the proof (or part of it) on which the representation is based. See, for example, Bunt, et al., "Towards an ISO standard for dialogue act annotation," Proc. 7th Int'l Conf. On Language Resources and Evaluation (LREC'10), European Language Resources Association (ELRA), pp 2548 -. 2555, 2010. In one embodiment, if the prosodic information (for example, the intonation information, tone, emphasis and / or rhythm of the user's utterance 62) is available in the output of an ASR 72 system available, one can also consider as a proof. See, Milone, et al., "Prosody and Emphasis Information for Automatic Speech Recognition," IEEE Trans. On Speech and Audio Processing, 11 (4): 321-333, 2003.
The statistical dialogue state tracking model 50 maintains, at each discrete moment t + 1, the probability distribution on the state b (st) called the belief in the state.
The process of a transactional type of window filling of a dialog management method is summarized in Figure 2. The method starts at S100. At S102, the tracking templates A, B, C, and M are learned by the dialog tracking learning component using annotated dialog round sequences with their corresponding state descriptors. At S104, in a new dialog, the intention of the user can be detected by the intention detection component 34. The intention detection is usually an NLU problem resulting in the identification of the task that the user wishes to see the system to accomplish. This step determines the set of variables to be instantiated during the window filling process (S106). Dialogue management assumes that a set of variables is needed for each predefined intent.
The dialog management window filling process (S106) includes the dual and sequential information gathering (S108) and dialog status tracking (S110) tasks. These are performed substantially iteratively until the predefined windows are each filled, for example, with a respective most probable value of the respective variable having at least one threshold probability score. Once all the variables have been properly instantiated, as in existing dialog systems, a final general confirmation of the task desired by the user is made (S112) before executing the requested task (S114). The method ends at S116.
As noted above, two different statistical approaches were used to maintain the statewide distribution on the sequential NLU outputs. The discriminant approach aims at modeling the posterior probability distribution of the state at time t + 1 with respect to the state at time t and at observations z1: t. The generative approach aims to model the transition probability and the probability of observation in order to exploit the possible interdependencies between hidden variables that make up the dialogue state. A description of these existing methods follows before describing the learning the examples of prediction methods and learning spectral status monitoring.
Conventional discriminative dialogue status tracking The discriminative state dialogue tracking approach calculates belief in each state using a learned conditional model that directly represents belief b (st + 1) = p (ss 1 st, zt). Maximum entropy is widely used for the discriminant approach, which formulates belief in the state as follows: b (s) = P (s | x) = (1) where: η is a normalization constant, x = (ά ^, άΤ'ει, -, ά ^, ά ^, ε ^ is the history of user dialog acts df.ie {1 ..... t}, the system dialog acts df.ie ie {1, and the sequence of states s, at the current dialogue turn at time t, </> (.) is a vector of characteristic functions on x, west the set of model parameters to learn from dialog data, and T is the transpose.
According to this formulation, the posterior computation must be performed for all possible state realizations in order to obtain the normalization constant η. This is usually not possible for real dialog domains that can have a large number of variables and possible variable values. Therefore, for the discriminative approach to be treatable, the size of the state space is generally reduced. For example, one approach is to limit the set of possible state variables to those that appeared in the NLU results. See, Metallinou, et al., "Discriminative State Tracking for Dialog Systems," Association for Computer Linguistics, pp 466-475, 2013. Another approach assumes conditional independence between dialogue state components to meet scalablity, and uses conditional random fields. See, Lee, et al., "Unsupervised Language Understanding for Multi-Domain Dialog System," IEEE Trans. on Audio, Speech & Language Processing, 21 (11): 2451-2464, 2013. Deep neural networks operating on a sliding window of utterance features extracted from previous user tours have also been suggested. See, Henderson, et al., "Word-based state tracking dialog with current neural netorks," Proc. SIGDIAL, pp. 296-299, 2014.
Conventional Generative Dialogue State Tracking The conventional generative approach to dialogue state tracking calculates belief in each state using the Bayes rule, with the belief of the last turn bÇs ^) as before and the probability given the user statement assumptions p (zt st). . In Williams, et al, "Factored partially observable Markov decision processes for dialogue management", 4th Workshop on Knowledge and Reasoning in Practical Dialog Systems, pp 76-82, 2005, the probability is factorized and some assumptions of independence are made: Is it then that p (st | cf I, st-i) is the probability of the state at the moment of celebration given the act of system dialogue and state at time t-1, p (zt st) is the probability of user statement z, (for example, word bag representation) given the state at time f , b {st.1 ht_1) is the belief in the state at time t-1 given ht · ·, (a set of characteristics describing the dialogue from t = 0 to t-1), and p ( zt) is the probability of the user's utterance (for example, word bag representation).
A typical generative modeling of a dialog state tracking process uses a hidden factorial Markov model. See Ghahramani, et al, "Factorial Hidden Markov Models," Machine Learning, 29 (2-3): 245-273, 1997. In this family of approaches, scalability is often a problem. One way to reduce the amount of computation is to group the states into partitions, proposed as model Hidden Information State (HIS) model. See, Gasic, et al., "Effective Handling of State Dialogue in the Hidden Information State-based POMPD Dialog Manager," J. ACM Trans. on Speech and Language Processing (TSLP), 7 (3) 4, 1-4, 2011. Another approach to dealing with the problem of scalability in this type of dialogue state monitoring is to adopt factored dynamic Bayesian networks. through conditional independence assumptions between dialogue state components, and by using approximate inference algorithms such as loop belief propagation or blocked Gibbs sampling. See Thomson, et al, "Bayesian Update of Dialogue State. A POMPD Framework for Talking Dialogue Systems, Computer Speech & Language, 24 (4): 562-588, 2010; and Raux, et al., "Interspeech, pp 801-804, 2011.
A decompositional model for tracking concealed temporal variables The example of decomposition model 50 and the learning and prediction procedures are now described in more detail. The method provides a generative / discriminative compromise by selecting a generative model to make predictions, but using a discriminative type of approach to model learning, due to the choice of linear factors to conditionally link the variables that make up the dialog state. This method combines the precision of a discriminative model with the expressivity of a generative model.
In one embodiment, the parameter learning procedure may be processed as a Ridge-resolved matrix decomposition task by the alternate least squares method or other suitable matrix decomposition method, such as descent methods. stochastic gradient or proximal gradient. The Ridge regression method allows asymmetric penalization of one or more of the targeted variables that status tracking identifies.
Figure 3 illustrates the underlying probabilistic graphical model defining the Spectral State Monitoring approach as a directed model of latent variables A, B, and C. Figure 4 illustrates the corresponding factor model. In this model, the three factors are linear, 4> iP (.St + i A, BSt + 1) = ATBSt + i, 02: p (st | 4, fIS £) = ATBSt and φ3 · .ρ (ζ ( Α, C) = ATC, where BSt is the column of the matrix B which corresponds to the embedding of sf and BSt + i is the column of B which corresponds to the embedding of the variables of st + 1.
In the probabilistic graphical models in Figure 3 and Figure 4, K represents the number of descriptors making up an observation, N is the number of transition examples in the set of learning data, and M is the number of descriptors. , also called variables, describing the state to follow. Learning method (S102)
In the following, the terms "optimization", "minimization" and similar phraseology should be interpreted broadly as one skilled in the art would understand these terms. For example, these terms should not be interpreted as being limited to the optimal global absolute value, an absolute minimum, and so on. For example, minimizing a function can use an iterative minimization algorithm that ends with a stop criterion before reaching an absolute minimum. It is also envisaged that the optimal or minimum value is an optimal local or minimum local value.
Figure 5 illustrates an example of a learning method. At S202, a set of annotated transactional conversations is provided for the current transaction type (for example, restaurant reservation in the case of the example). More precisely, for each turn of each of these conversations, the state of fundamental truth is provided. For example, at a given turn, if the customer says, "I want to eat in Barcelona," we annotate the tour with an update to the location, specifying the location at the end (time t + 1) of the turn as Barcelona, the other state variables being the same as those at the beginning of the turn (time t). At S204, matrices A, B, and C are defined. Figure 6 illustrates the collective matrix factorization task of the non-parametric learning procedure of the state tracking model. For the sake of simplicity, the matrices B and C are concatenated to form a single matrix D and M is the concatenation of the matrices.
In this step, the matrix M is filled with the data of the lathe. More precisely, each row comprises the known values of each of the state variables before and after the transition and the corresponding dialogue representation 66. The matrix M thus comprises, for each of a plurality of rows, each row corresponding to one turn. , initial and final state descriptors, i.e., values for each of the state variables at time t and corresponding values for each of the state variables at time t + 1, and the observation Z, from what has been said at the instant t which is presumed to be responsible for the transition from the state st to the state st + i. Thus, for example, the row respectively of the matrix Zt consists of a word bag representation of the words that have been said between the state t and the state t + 1 (and / or other characteristics extracted from the statement), for example in the form of 1 and 0 encoding each of the set of possible words / characteristics. The number of rows in the matrix M present during the learning step may be at least 20 or at least 100, or at least 1000 or more.
The first latent variable matrix A is instantiated with one row for each row of the matrix M and the matrices B and C are instantiated with one column for each of the possible state variable values and one column for each of the representation characteristics. of dialogue. At this point, matrices A, B and C are empty and can be initialized with random or otherwise generated initial values. At S206, a decomposition of the matrix M into the first and second matrices A and D is learned, for example, by an approximation method such as the alternating least squares method. In this step, the embeddings of the transitions (in the matrix A) and each of the state variable values and each of the characteristics in the dialogue representation (in the matrix D) are learned together. Accordingly, each row of the matrix A and each column of the matrix D includes a set of hidden variables.
The matrices A, B and C are matrices of low rank. In the matrix of latent variables A, each row (embedding) represents a respective embedding of a transition filling the corresponding row of the matrix M (state at time t, state at time t + 1 and the representation of the user dialogue act). Each row thus comprises a set of latent variables (illustrated by dots in Figure 6), which constitute the embedding of one of the transitions. The number of latent variables (columns of matrix A) is fixed and can be, for example, at least 5, or at least 10, or at least 50, or at least 100, and can The matrix of latent variables B comprises the embedding of the variables which represent the embeddings of each possible state variable at time t and at time t + 1. Each column corresponds to a pair (state variable, value) which has a precise value (coding 0, 1), and the latent variables in this column constitute the embedding of this pair. The matrix of latent variables includes the embedding of word bag representations. In matrix C, each column represents one of the set of words (and / or other characteristics) in the word bag / other characteristics. The number of latent variables (rows) in matrices B and C is set, and can be, for example, at least 5, or at least 10, or at least 50, at least 100, and can be as high as 10000 or 1000. Since both matrices B and C are concatenated to form matrix D, they have the same number of latent variables. Matrices A and D may have the same number of latent variables. The more latent variables are used, the easier it is to reconstruct an approximation of the matrix M as the product of matrices of latent variables A and B. However, this is done with an increased calculation cost and can lead to a poorer performance. prediction with higher number of variables. The optimal number of latent variables can depend on the dataset and can be selected by ranking the system performance using different values. At S208, the latent variable matrices A and D are stored. Equation 3 defines the exemplary optimization task performed in step S206, namely the loss function associated with latent variable learning {A, DJ.
(3) that is, a function of the minimum of the difference between the real matrix M and the product of matrices of latent variables A and B, conditioned by weights W, and where {λα, λα} 6 IR2 are optional hyper-adjustment parameters (scalar values) that can be learned by cross-validation, and W is a diagonal matrix that increases the weight of some of the state variables, st + i, in order to bias the results for better predictive accuracy of these specific variables. The weight matrix can be learned by cross-validation. Weights are chosen to improve the agreement between the reconstituted matrix M '(formed as product of matrices A and B) and the actual matrix. This type of weighting approach has been shown to be effective in other types of generative / discriminative task compromises. See, for example, Ulusoy, et al., "Toward Category-Level Object Recognition, pp 173-195, 2006, Bishop, et al," Generative or discriminative Getting the best of both words, "BAYESIAN STATISTICS, 8: 3-24, 2007. | · | represents the Frobenius norm of the respective matrix (the square root of the sum of the absolute squares of its elements). However, other matrix standards can be used. In the exemplary embodiment, the weight matrix has a greater impact on at least some of the end state descriptor characteristics of the reconstructed matrix M than for the corresponding characteristics of the initial state descriptor.
To perform the minimization task shown in Equation 3, one can use a matrix decomposition method, such as the alternating least squares method which is a sequence of two convex optimization problems. In a first step, for known D, we calculate the matrix A which minimizes the equation 4: A * = argminA \ (M-AD) W \ l + λα \ Α \ 1, (4)
Then, for a known A, one calculates the matrix D which minimizes the equation 4: ZT = argminB \ {M - AD) W \ l + Xd \ D \ l (5)
At the beginning, the matrix A (and / or D) can be initialized with random values or a decomposition of singular values of the matrix M.
By iteratively solving the two optimization problems, one obtains the fixed regularized Ridge regression forms of the algorithms by the weighted alternating least squares method: A <- (DTWD + λαS) ~ 1DTWM (6) D <- (ATA + Àday1ATM (7) where Π is the identity matrix As shown in Equation 6, the matrix W is used only for the update of A, since only the columns of D, representing the characteristics of the state, are weighted differently.For the optimization of the embeddings of D, presented in equation 7, each call session dip stored in A has the same weight, therefore in this second step of the algorithm, W is in fact an identity matrix and therefore does not appear, more precisely, the state variables are a concatenated zero-one coding of the set values of each variable.
When a new observation Zt is received at time t, the posterior distribution of each of the state variables at time t + 1 is predicted, st and latent variable matrices A and D are given. The prediction step (1) assumes the calculation of the dip of the current transition by solving the corresponding least squares problem based on the two variables {s ,, z,} which constitute the current knowledge of the state at the instant t and the words bag of the last round composed of the system and user statements (generation of a new row in matrix A); and (2) P estimating the missing interesting values, i.e., the probability of each value of each of the variables constituting the state s, + 1 at time (t + 1), by calculation of the product vector between the transition embedding calculated in (1) and the corresponding column embeddings of the matrix D, the value of each variable of st + i- More precisely, this decomposition can be written in the following way: M = A. DT (8) where M is the data matrix used to perform the decomposition. M includes a middle row for each transition. As indicated above, A has one row for each transition dip, and D has one column for each variable value dip in a zero-one encoding. When a new row of observations mi is received for a new set of variables for the state s | and the observations z [and that D is fixed, the purpose of the prediction task is to find the corresponding row aj of A to be such that: af-DT "mj (9)
It is generally difficult to require that they be equal, but it may be necessary that they have the same projection in the latent space: a [.DT.D = mJ.D (10)
Then, the classical closed-form solution of a linear regression task can be calculated as: af = 7 ^ .0. (0 ^ .0) -1 (11) at = {DT.Dy DT.mt ( 12)
This formula is in fact the optimal value of the immersion of the transition m ,, with the hypothesis of using a quadratic loss. In the opposite case, it is an approximation, in the case of a logistic loss for example. As will be appreciated, while in the equation. 11, (DT.Z)) - 1 causes a matrix inversion, it is only for a matrix of small dimension (the dimension of the embeddings).
Thus, given m, (which includes only the values of ht and zlt), in step (1) the embedding is calculated using equation (12). Then, in step (2), the missing values s} +1 are calculated by multiplying a by only those of the columns of the matrix B corresponding to the embeddings of the state characteristics st + i.
The prediction output is the distribution with respect to the values for each of the state variables s, + 1 at time t + 1. This distribution, or the most likely values, can be propagated to the next row of the matrix as st, and the prediction method can be iterated at each interaction with the client until the predefined variables have been instantiated . As will be understood, at some iterations, there can be no change to the state variables.
Several advantages can be identified in this approach. First, at the time of learning, alternating Ridge regression is computationally efficient, because a closed-form solution exists at each stage of the optimization process used to derive parameters from it; that is, the matrices of low rank, of the model. Secondly, at the time of the decision, the state tracking procedure involves (1) calculating the a dip of the current transition using the current state estimate and the observation set as a whole. course zt, and (2) the calculation of the distribution on the state defined as a vector-matrix product between a and the latent matrix D.
FIG. 7 illustrates the generation of the matrix M at the instant of the prediction (i .. after learning the embedding of rows and columns forming the matrices A and D. The system first establishes the client's intention to find a restaurant (S106) .The intent can be deduced from the user who has logged on to an associated restaurant reservation site, and / or through simple questions (eg, "to book a restaurant say YES or press 1 now ", ..... (wait for the customer's answer), ... to book a hotel, say Yes or press 2, now") The system then instantiates the correct matrix M with the appropriate set of windows For ease of illustration, it is assumed that there are only three variables, location, type and time, for the restaurant reservation matrix M and only two values for each variable. also a window for each of a set of words that the sys The system is configured to recognize in the word bag representation 66 the system and client statements 64, 62 for a given turn. In a first round (turn 1), the information gathering component 36 generates a system statement 64 asking where the customer wants to eat and the customer responds. The information collection component 36 analyzes the client statement 62. The representation generation component 40 identifies a bag of words from the output of the information gathering component, which includes the words eat and Rome. The window filling component 38 fills the values s, and the word bag representation in a row of the matrix. The dialog follower prediction component predicts the st + 1 values of that row, using the acquired latent variable matrices A and D. These values become the st values for the next round. In this round, the user's statement was not recognized with sufficient confidence and so in the next round, the information gathering component asks if the customer has said Spanish, which the customer confirms, and the prediction is repeated. Once a window has been changed to 1 for each variable, the system confirms the client's goal (turn 5), before proceeding with the execution of the transaction, which may include finding Spanish restaurants in Rome in a database that has a table available at 8 pm and present a list to the client for review and selection. The system can then make a restaurant reservation for the customer, based on the selection of the customer.
The method illustrated in Figures 2 and 4 can be implemented in software that can be run on a computer. The software may include a non-transitory, computer readable recording medium on which a control program is recorded (stored), such as a disk, a hard disk, or the like. Typical forms of computer-readable non-transit media include, for example, floppy disks, floppy disks, hard disks, magnetic tapes, or any other magnetic storage medium, CD-ROM, DVD, or any other optical medium, a RAM, a PROM, an EPROM, a FLASH-EPROM memory, or other chip or memory cartridge, or any other non-transitory medium that a computer can read and use. The software may be integral with the computer 18, (for example, an internal hard disk or RAM), or may be separated (for example, an external hard disk operationally connected to the computer 18), or may be separated and accessible through a digital data network such as a local area network (LAN) or the Internet (for example, a redundant network of inexpensive independent disks (RAID) or other network server storage that is accessible indirectly by the computer 18, via a digital network).
Alternatively, the method may be implemented in transient media, such as a transmissible medium in which the control program is embodied as a data signal using transmission media, such as acoustic or light waves. such as those produced during radio and infrared data communications, etc.
The exemplary method may be implemented on one or more general purpose computers, special purpose computers, a programmed microprocessor or microcontroller and peripheral integrated circuit elements, an ASIC or other integrated circuit, a digital signal processor, an electronic or logic circuit wired for example a discrete element circuit, a programmable logic device such as a PLD, PLA, FPGA, a CPU (GPU), or PAL graphics card, or the like. In general, any device capable of implementing a finite state machine which is in turn capable of implementing the flowchart shown in FIG. 2 and / or 4, can be used to implement the method. . As will be appreciated, while the steps of the method may all be implemented on a computer, in some embodiments, one or more of the steps may be at least partially manually performed. As will also be understood, the steps of the method need not all be carried out in the illustrated order and fewer steps, more steps or different steps can be performed.
The exemplary method and system have advantages over existing systems (1) by producing a joint probability modeling of the transition of hidden variables composing a given dialogue state and the observations that make it possible to follow the belief current in the user's objectives while explicitly taking into account the possible interdependencies between state variables; and (2) providing a computational framework, based on a collective matrix factorization, to effectively infer the distribution on the state variables in order to derive a proper information retrieval dialogue policy in a such context. While transactional dialog tracking is especially useful in the context of stand-alone dialog management, the system can also be applied in dialog machine reading and knowledge extraction from human-to-human dialog corpora.
Without wishing to limit the scope of the exemplary embodiment, the following examples illustrate the application of the method to a pre-existing data set.
EXAMPLES
First, the dialogue domain used for the evaluation of the dialogue follower and the component probability models used for the domain are described. Then, a set of experimental results obtained with the follower and a comparison with the existing followers are described.
The DSTC-2 dialogue domain described in Williams, et al., The Dialog State Tracking Challenge, Proc. SIGDIAL 2013 Conf, pp 404-413, 2013. In this area, the user queries a database of local restaurants. The Restaurant Information Domain dataset was collected using Amazon Mechanical Turk. A usual dialogue is as follows: first, the user specifies his set of personal constraints concerning the restaurant. Then the system offers the name of a restaurant that satisfies the constraints. The user then accepts the offer, and may request additional information about the accepted restaurant. The dialog ends when all the information requested by the user has been provided. In this context, the dialog state follower must be able to follow three types of information that make up the dialog state: windows geographic area, kitchen type, and price range. As will be understood, the follower can easily be set up to follow the other variables as well as if they are fully specified. The dialogue state follower updates the belief in turn, receiving the proof by the NLU module as the actual statement has been generated by the user.
In this experiment, the output of the NLU module was limited to a word bag representation of each user statement in order to be comparable with existing approaches for state tracking that only use information such as proof. The task of the dialogue state follower 32 is to generate a set of possible states and their confidence scores for each window, the confidence score corresponding to the posterior probability of each state variable. In addition, the dialog state follower also maintains a special state variable, called "None", which indicates that the true true objective state has not yet been observed.
Experimental state monitoring results were obtained for this dataset and compared to existing generative and discriminant approaches. Table 1 gives the variables and the domain expression for each of them. TABLE 1: Information Windows DSTC2 (Restaurant Information Domain)
Table 2 details the precision performance results at position (P @ n) obtained on the DSTC-2 data set for a set of embedding dimensions of the collective factorization matrix model. The model is able to accurately determine a small subset of the hypothesis where the correct instantiation is present for each variable.
Then, for comparison with existing methods, Table 3 presents the results of the CMF best model accuracy, an embedding dimension of 100, where the value of each window is instantiated as the most likely relative to the inference procedure described above. The results are obtained for several existing generative and discriminative state tracking methods on this dataset. Specifically, as expected by these approaches, the precision score calculates p (s1 + 1 st, zt).
The following existing followers have been compared: 1. A rules-based system described in Zilka, et al., "Comparison of Bayesian Discriminative and Generative Models for State Tracking Dialogue," Proc. SIGDIAL 2013, pp 452-456, 2013. 2. An HMM model (HWU) described in Wang, "HWU baseline belief tracker for DSTC 2 & 3. "Technical Report, Heriot-Watt University, 2013. 3. A modified HMM model (HWU +) described in Wang 2013. 4. A maximum entropy model (MaxEnt) described in Lee, et al," Recipe for building robust State trackers: Dialog State tracking system description »Proc. SIGDIAL 2013, pp 414-422, 2013, which is a type of discriminative model. 5. A deep neural network (DNN) architecture as described in Henderson, et al., "Word-based state tracking dialog with recurrent neural networks," Proc. . SIGDIAL, pp 296-299, 2014. 6. CMF, the present method, where the following number represents the size of the embedding vectors in each of matrices B, C. Thus, for example, CMF-100 indicates that the matrix of embedding has 100 hidden variables in each row and matrices B, C have 100 hidden variables in each column.
Table 2: Clarification of the Proposed Model Versus Other Followers on the DSTC-2 Dataset
Table 3: Cross-validation precision results 10 times obtained for each of the three variables monitored simultaneously (zone, kitchen type and price range) of the DSTC-2 dataset
The results suggest that the exemplary system and method allow for efficient dialogue state tracking in the transactional context of autonomous dialogue systems. The results suggest that consumer services, and more broadly in the context of discussion automation agent platforms, will be able to handle dialogue management more efficiently using the system and method. Indeed, issues related to contracting, billing, insurance management devices can be automated using such a structure. The system and method will obviously be applicable to any dialog field that can be formalized as a window filling task type. More specifically, such a system allows efficient tracking of hidden variables defining the user's goal of a task-oriented dialogue using any type of evidence available, from the bag-of-words statement to the output of a module of Natural Language Comprehension NLU.

权利要求:
Claims (2)
[1" id="c-fr-0001]
claims
A dialog state tracking computer system (10) comprising: a memory (12) which stores first and second latent variable models that have been learned by reconstruction of a decompositional model, the decomposition model having been generated at from annotated learning dialogues and having for each of a plurality of dialog state transitions, state descriptors for the initial and final states of the transition and a respective conversation representation, the first latent variable model learned, including embeddings of the plurality of state transitions, and the second model of latent variables learned, including feature embeddings of the state descriptors and feature embeddings of the dialogue representations; an information collection component (36) that receives a user's statement for each of a plurality of new dialog state transitions; a representation generation component (40) that generates a dialogue representation based on the user's utterance; a prediction component (32) that predicts a state descriptor for a final state of each new dialogue state transition using the learned first and second latent variable models, an initial state descriptor, and the representation dialogue; and a processor that implements the information collection component, the representation generation component, and the prediction component.
[2" id="c-fr-0002]
The computer system of claim 1, further comprising a learning component for learning the first and second latent variable models.

类似技术:

公开号 | 公开日 | 专利标题

FR3041791A1|2017-03-31|GENERATIVE / DISCRIMINATORY APPROACH TO MONITORING TRANSACTIONAL DIALOGUE STATES BY COLLECTIVE MATRIX FACTORIZATION

US10540967B2|2020-01-21|Machine reading method for dialog state tracking

US9977778B1|2018-05-22|Probabilistic matching for dialog state tracking with limited training data

US20200365142A1|2020-11-19|Encoder-decoder models for sequence to sequence mapping

CN112567394A|2021-03-26|Techniques for constructing knowledge graphs in limited knowledge domains

WO2017007740A1|2017-01-12|Learning word embedding using morphological and contextual knowledge

US10204097B2|2019-02-12|Efficient dialogue policy learning

US20180174037A1|2018-06-21|Suggesting resources using context hashing

US11100916B2|2021-08-24|Speech recognition method and apparatus

US10977664B2|2021-04-13|Method and apparatus for transferring from robot customer service to human customer service

US10861456B2|2020-12-08|Generating dialogue responses in end-to-end dialogue systems utilizing a context-dependent additive recurrent neural network

CN104541324A|2015-04-22|A speech recognition system and a method of using dynamic bayesian network models

US11145291B2|2021-10-12|Training natural language system with generated dialogues

US10861455B2|2020-12-08|Inference on date time constraint expressions

CA3147634A1|2021-01-21|Method and apparatus for analyzing sales conversation based on voice recognition

US9836669B2|2017-12-05|Generating a reference digital image based on an indicated time frame and searching for other images using the reference digital image

US11133001B2|2021-09-28|Generating dialogue events for natural language system

US10861440B2|2020-12-08|Utterance annotation user interface

US11004449B2|2021-05-11|Vocal utterance based item inventory actions

US20200302316A1|2020-09-24|Question answering system influenced by user behavior and text metadata generation

US9753745B1|2017-09-05|System and method for system function-flow optimization utilizing application programming interface | profiling

EP3188087A1|2017-07-05|Applying matrix factorization techniques for context-based location prediction

US20210133535A1|2021-05-06|Parameter sharing decoder pair for auto composing

US20210312923A1|2021-10-07|Sequence-to-sequence speech recognition with latency threshold

US20210065897A1|2021-03-04|Online partially rewarded learning

同族专利:

公开号 | 公开日

US9811519B2|2017-11-07|

US20170091171A1|2017-03-30|

引用文献:

公开号 | 申请日 | 公开日 | 申请人 | 专利标题

US5019958A|1990-02-12|1991-05-28|Varga Ljubomir D|Generalized synthesis of control systems of zero-order/instantaneous response and infinite disturbance rejection ratio|

US6151698A|1996-04-30|2000-11-21|Cadence Design Systems, Inc.|Method for determining the steady state behavior of a circuit using an iterative technique|

US5946656A|1997-11-17|1999-08-31|At & T Corp.|Speech and speaker recognition using factor analysis to model covariance structure of mixture components|

US6035288A|1998-06-29|2000-03-07|Cendant Publishing, Inc.|Interactive computer-implemented system and method for negotiating sale of goods and/or services|

US6308157B1|1999-06-08|2001-10-23|International Business Machines Corp.|Method and apparatus for providing an event-based “What-Can-I-Say?” window|

US7174300B2|2001-12-11|2007-02-06|Lockheed Martin Corporation|Dialog processing method and apparatus for uninhabited air vehicles|

US7197460B1|2002-04-23|2007-03-27|At&T Corp.|System for handling frequently asked questions in a natural language dialog service|

US7869998B1|2002-04-23|2011-01-11|At&T Intellectual Property Ii, L.P.|Voice-enabled dialog system|

US7249321B2|2002-10-03|2007-07-24|At&T Knowlege Ventures, L.P.|System and method for selection of a voice user interface dialogue|

US20040264677A1|2003-06-30|2004-12-30|Horvitz Eric J.|Ideal transfer of call handling from automated systems to human operators based on forecasts of automation efficacy and operator load|

US7424150B2|2003-12-08|2008-09-09|Fuji Xerox Co., Ltd.|Systems and methods for media summarization|

US7672830B2|2005-02-22|2010-03-02|Xerox Corporation|Apparatus and methods for aligning words in bilingual sentences|

US9431006B2|2009-07-02|2016-08-30|Apple Inc.|Methods and apparatuses for automatic speech recognition|

US9318108B2|2010-01-18|2016-04-19|Apple Inc.|Intelligent automated assistant|

US9576574B2|2012-09-10|2017-02-21|Apple Inc.|Context-sensitive handling of interruptions by intelligent digital assistant|

US8935167B2|2012-09-25|2015-01-13|Apple Inc.|Exemplar-based latent perceptual modeling for automatic speech recognition|

US9058303B2|2012-11-30|2015-06-16|Xerox Corporation|Convex collective matrix factorization|

US9368114B2|2013-03-14|2016-06-14|Apple Inc.|Context-sensitive handling of interruptions|

US9318109B2|2013-10-02|2016-04-19|Microsoft Technology Licensing, Llc|Techniques for updating a partial dialog state|

EP3055785A4|2013-10-07|2017-06-07|President and Fellows of Harvard College|Computer implemented method, computer system and software for reducing errors associated with a situated interaction|

US9502031B2|2014-05-27|2016-11-22|Apple Inc.|Method for supporting dynamic grammars in WFST-based ASR|

US9430463B2|2014-05-30|2016-08-30|Apple Inc.|Exemplar-based natural language processing|

US10108608B2|2014-06-12|2018-10-23|Microsoft Technology Licensing, Llc|Dialog state tracking using web-style ranking and multiple language understanding engines|

US20160078520A1|2014-09-12|2016-03-17|Microsoft Corporation|Modified matrix factorization of content-based model for recommendation system|

US10339916B2|2015-08-31|2019-07-02|Microsoft Technology Licensing, Llc|Generation and application of universal hypothesis ranking model|WO2017168246A1|2016-03-29|2017-10-05|Maluuba Inc.|Hierarchical attention for spoken dialogue state tracking|

US10242667B2|2016-06-03|2019-03-26|Maluuba Inc.|Natural language generation in a spoken dialogue system|

CN107704456A|2016-08-09|2018-02-16|松下知识产权经营株式会社|Identify control method and identification control device|

GB2559618B|2017-02-13|2020-07-08|Toshiba Kk|A dialogue system, a dialogue method and a method of adapting a dialogue system|

US10360908B2|2017-04-19|2019-07-23|International Business Machines Corporation|Recommending a dialog act using model-based textual analysis|

US10224032B2|2017-04-19|2019-03-05|International Business Machines Corporation|Determining an impact of a proposed dialog act using model-based textual analysis|

GB201707766D0|2017-05-15|2017-06-28|Microsoft Technology Licensing Llc|Filtering of large sets of data|

CN109086282A|2017-06-14|2018-12-25|杭州方得智能科技有限公司|A kind of method and system for the more wheels dialogue having multitask driving capability|

US10635707B2|2017-09-07|2020-04-28|Xerox Corporation|Contextual memory bandit for proactive dialogs|

WO2019074509A1|2017-10-12|2019-04-18|Google Llc|Determining state of automated assistant dialog|

US10782095B2|2017-11-24|2020-09-22|Huntercraft Limited|Automatic target point tracing method for electro-optical sighting system|

KR20190076452A|2017-12-22|2019-07-02|삼성전자주식회사|Method and apparatus for generating natural language|

US20190327330A1|2018-04-20|2019-10-24|Facebook, Inc.|Building Customized User Profiles Based on Conversational Data|

US10861446B2|2018-12-10|2020-12-08|Amazon Technologies, Inc.|Generating input alternatives|

US10957320B2|2019-01-25|2021-03-23|International Business Machines Corporation|End-of-turn detection in spoken dialogues|

US10997977B2|2019-04-30|2021-05-04|Sap Se|Hybrid NLP scenarios for mobile devices|

KR102281581B1|2019-07-17|2021-07-23|에스케이텔레콤 주식회사|Method and Apparatus for Dialogue State Tracking for Use in Goal-oriented Dialogue System|

US10818293B1|2020-07-14|2020-10-27|Drift.com, Inc.|Selecting a response in a multi-turn interaction between a user and a conversational bot|

法律状态:
2017-08-21| PLFP| Fee payment|Year of fee payment: 2 |

2018-08-22| PLFP| Fee payment|Year of fee payment: 3 |

2019-02-15| PLSC| Search report ready|Effective date: 20190215 |

2019-08-16| TP| Transmission of property|Owner name: CONDUENT BUSINESS SERVICES, LLC, US Effective date: 20190710 |

2020-01-31| PLFP| Fee payment|Year of fee payment: 4 |

2020-09-25| PLFP| Fee payment|Year of fee payment: 5 |

2021-08-19| PLFP| Fee payment|Year of fee payment: 6 |

优先权:

申请号 | 申请日 | 专利标题

US14/864,076|US9811519B2|2015-09-24|2015-09-24|Generative discriminative approach for transactional dialog state tracking via collective matrix factorization|

[返回顶部]